Face dissimilarity judgments are predicted by representational distance in morphable and image-computable models

Significance Discerning the subtle differences between individuals’ faces is crucial for social functioning. It requires us not only to solve general challenges of object recognition (e.g., invariant recognition over changes in view or lighting) but also to be attuned to the specific ways in which face structure varies. Three-dimensional morphable models based on principal component analyses of real faces provide descriptions of statistical differences between faces, as well as tools to generate novel faces. We rendered large sets of realistic face pairs from such a model and collected similarity and same/different identity judgments. The statistical model predicted human perception as well as state-of-the-art image-computable neural networks. Results underscore the statistical tuning of face encoding.


Basel Face Model.
For face pairs where cosine distance was undefined, because one face lay at the origin of BFM space, the angle between the two faces was defined as zero for the purposes of model evaluation. To more fully explore the relationship between apparent dissimilarity and placements of faces in the full BFM space, we also considered linear and sigmoidal functions as candidates for predicting the relationship between the Euclidean distance in the BFM and face dissimilarity judgements. We estimated each model's predictive performance as the Pearson correlation between the fitted model's predicted dissimilarities and the dissimilarities reported by each participant. We tested for significant differences between linear and sigmoidal function fits using a two-sided Wilcoxon signed-rank test. For each participant, we fitted the model to half of the data (session 1) and measured the predictive accuracy of the model in the second half of the data (session 2). The predictive accuracies were averaged across participants.
Finally, the BFM provides the axes onto which the height, weight, age, and gender of the 3D scanned participants most strongly loads. By projecting new face points onto these axes, we can approximately measure the height, weight, age and gender of each generated face. The "Person attributes" model consisted of the Euclidean distance between faces, after projecting faces onto these four dimensions.
Models based on 3D face structure. We selected 30 vertices on each face corresponding to key locations such as the centre and edges of each eye, the edges of the mouth, nose, jaw, chin, and hairline, using data provided in the BFM. The positions of these 30 vertices on each 3D face mesh formed the features for the "0th order" configural model. We then calculated 19 distances between horizontal and vertically aligned features (e.g. width of nose, length of nose, separation of eyes), which formed the "1st order" configural model. Finally, we calculated 19 ratios among these distances (e.g. the ratio of eye separation to eye height; the ratio of nose width to nose length), which formed the "2nd order" configural model. For all configural models, the predicted dissimilarity between two faces was the Euclidean distance between their respective feature vectors.
Deep neural networks. We formed the VGG-BFM-identity classification network by training the VGG-16 architecture (1, TorchVision's implementation) to classify Basel Face Model (2) face images of 8,631 synthetic identities (Supplementary Figure 14). All of the images pertaining to one identity shared shape and texture latents (both randomly sampled from the Basel Face Model, once per identity), but had different expression latents, poses, lighting direction, and lighting intensity. We generated 363 images of each identity to roughly match the total number of training images in the VGGFace2 dataset (3). The rendered images were randomly cropped during training or centre-cropped during validation, in both cases yielding an input image of 224 x 224 pixels. To further increase the images' variability, we augmented the training examples using Albumentations (4). We only included naturalistic transformations such as grayscale transformation, brightness and contrast manipulations, noise addition, drop-out, grid distortion, and blurring. See Supplementary Figure 16a for training-set image examples. The input images were normalized by the channelspecific mean and standard deviation, computed from a subset of training images. The model was trained for 30 epochs on minimizing the cross-entropy loss, using four GPUs. We used stochastic gradient descent with a weight decay of 0.0001, momentum of 0.9, and a minibatch size of 512. The learning rate was initialised to 0.01 and reduced by a factor of 10 every 10 epochs. The model reached a 0.0 validation loss.
We formed the VGG-BFM-latents regression network by mapping the penultimate layer of a VGG-16 architecture to a 508-dimensional vector as the last fully connected layer and training the network to recover the underlying latents of synthetic face images sampled from the Basel Face Model (2, Supplementary Figure 15). 199 of the 508 output units were assigned as predicted shape coefficients, 199 as predicted texture coefficients, 100 as predicted expression coefficients, 4 as predicted face pose (parameterized as a quaternion), 3 as lighting color, and 3 as lighting direction. The network was trained on minimizing the sum of six normalized mean squared error (NMSE) terms (i.e., shape, texture, expression, face pose, ambient lighting, and lighting direction). The synthetic dataset included 3,300,0000 unique faces generated similarly to the dataset used to train the VGG-BFM-identity model, except that each face was independently sampled from the BFM (i.e., without using synthetic identities) and we did not use dataset augmentations other than random cropping. The model was trained for 120 epochs using four GPUs. We used ADAM (5) with β0 = 0.9, β1 = 0.999, = 10 −8 , no weight decay, and a minibatch size of 512. The learning rate was initialised to 0.0001 and was reduced by a factor of 10 every 40 epochs. See Supplementary Figure 16b,c for quantitative and qualitative evaluation of the trained model's performance.
Supplementary Fig. 4. Uniformity test. a) Schematic of the distinction between perceptual isotropy and perceptual uniformity. If a space is perceptually isotropic (left), the pairs of faces (A,B) and (A',B') should appear equally dissimilar, because they correspond to vectors that have been rotated around the origin while preserving their geometric relationship to one another (the vectors span the same angle θ and have the same norms r1 and r2). If a space is perceptually uniform (right), the pairs of faces A-B and A'-B' should appear equally dissimilar, because they correspond to vectors that have been linearly translated in the space, preserving their Euclidean distance to one another (while disrupting their geometric relationship). b) Analysis evaluating evidence of perceptual uniformity in the stimulus set A experiment. Face pairs were binned into groups with similar Euclidean distances. We then evaluated whether the angle between the faces explains variance in perceived dissimilarity within each bin. If the space is non-uniform, we might expect faces with larger angular differences to appear more different, even if they have identical Euclidean distance. We find only weak evidence for any non-uniformity in the face space. Bins with significant correlation are indicated by an asterisk (one-sided Wilcoxon signed-rank test, P < 0.05 corrected). Error bars show the standard error of the mean based on single-participant correlations.  Fig. 9. Statistical comparisons between models after allowing a sigmoidal transformation to fit human dissimilarity data. a) Model performance data shown in Figure 4b, but with models ordered and statistically compared according to their performance after fitting a sigmoidal transform (within cross-validation folds) to raw model-predicted distances. Conventions are as in Figure 4b. b) Corresponding data for stimulus set B.    Fig. 15. The training procedure of the VGG-BFM-latents model. Each training or test sample was independently sampled from the Basel Face Model distribution and rendered using randomly sampled pose, light direction, light intensity, and background colour. The 3D rendering was followed by random cropping for training images or centre cropping for test images. The VGG-16 architecture was initialized with random weights and trained on recovering the latents underlying the input image, minimizing the sum of six Normalized Mean Squared Error (NMSE) terms, pertaining to BFM shape, texture, expression, pose (parameterized by quaternions), light direction, and light intensity. These error terms are computed from the squared differences between the ground-truth latents and six subsets of the 508-long output layer. Each term was normalized such that an optimal constant prediction would result in an expected NMSE of 1.0.